AITopics | markovian reward

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models.

interpretable reward redistribution, name change, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

Learning Task Specifications from Demonstrations

Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K. Ho, Sanjit Seshia

Neural Information Processing SystemsNov-20-2025, 17:29:29 GMT

In many settings (e.g., robotics) demonstrations provide a natural way to specify a task. For example, an agent (e.g., human expert) gives one or more demonstrations of the task from which we seek to automatically synthesize a policy for the robot to execute. Typically, one models the demonstrator as episodically operating within a dynamical system whose transition relation only depends on the current state and action (called the Markov condition). However, even if the dynamics are Markovian, many problems are naturally modeled in non-Markovian terms (see Ex 1).

logic & formal reasoning, machine learning, specification, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.46)

Add feedback

A Broader Impacts

Neural Information Processing SystemsOct-8-2025, 13:17:20 GMT

Markov Condition, are not independent of each other .

cau, dimension, markovian reward, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

402e12102d6ec3ea3df40ce1b23d423a-Paper-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 13:17:17 GMT

causal structure, dimension, markovian reward, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Add feedback

Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach

Neural Information Processing SystemsOct-11-2024, 14:25:10 GMT

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process.

causal approach, interpretable reward redistribution, reinforcement learning, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)

Add feedback

Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach

Zhang, Yudi, Du, Yali, Huang, Biwei, Wang, Ziyan, Wang, Jun, Fang, Meng, Pechenizkiy, Mykola

arXiv.org Artificial IntelligenceNov-10-2023

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models. Experimental results show that our method outperforms state-of-the-art methods and the provided visualization further demonstrates the interpretability of our method. The project page is located at https://reedzyd.github.io/GenerativeReturnDecomposition/.

causal structure, dimension, markovian reward, (12 more...)

arXiv.org Artificial Intelligence

2305.18427

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Hellinger KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

Roy, Arghyadip, Shakkottai, Sanjay, Srikant, R.

arXiv.org Machine LearningSep-14-2020

In the regret-based formulation of multi-armed bandit (MAB) problems, except in rare instances, much of the literature focuses on arms with i.i.d. rewards. In this paper, we consider the problem of obtaining regret guarantees for MAB problems in which the rewards of each arm form a Markov chain which may not belong to a single parameter exponential family. To achieve logarithmic regret in such problems is not difficult: a variation of standard KL-UCB does the job. However, the constants obtained from such an analysis are poor for the following reason: i.i.d. rewards are a special case of Markov rewards and it is difficult to design an algorithm that works well independent of whether the underlying model is truly Markovian or i.i.d. To overcome this issue, we introduce a novel algorithm that identifies whether the rewards from each arm are truly Markovian or i.i.d. using a Hellinger distance-based test. Our algorithm then switches from using a standard KL-UCB to a specialized version of KL-UCB when it determines that the arm reward is Markovian, thus resulting in low regret for both i.i.d. and Markovian settings.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2009.06606

Country: